Writing a python library for the Ziggy data serialization language

Ziggy is a nice file format that I find friendly and sane. I quite like:

  • that it is simple and familiar,
  • the prefix dot notation coming from Zig, so keys are syntactically different from strings,
  • the schema, so much more sound and readable than the JSON schema nonsense,
  • named structs and tagged literals giving context to human readers and machine parsers,
  • unions that play very nice with typed languages.

An example

Consider this JSON you get when asking Messenger for your data:

[
  {
    "sender_name": "Louis",
    "timestamp_ms": 1276134211232,
    "content": "Hey"
  },
  {
    "sender_name": "Piana",
    "timestamp_ms": 1276134252080,
    "photos": [
      {
        "uri": "messages/inbox/...",
        "creation_timestamp": 1276134252
      }
    ]
    "reactions": [
      { "reaction": "😱", "actor": "Louis" },
      { "reaction": "😄", "actor": "Piana" },
    ]
  },
  <...>
]

You don’t know what to expect from one message to the other, and must parse all fields. If there are photos, there is no content, and conversely. Maybe you can add a field called type to message on which you switch, but is it any better that reflection on the presence of fields with special names?

When I printed as a book of our conversations with my SO, I had a large, nested Go struct that was trying to reproduce all the possible field layout. I would have done better to write a custom decoder in hindsight, but I was not as well-trained as of today, and Christmas was coming…

Ziggy can do better. The above conversation becomes:

[
    TextMessage {
        .sender_name = "Louis",
        .timestamp = @unixtimestamp_ms("1276134211232"),
        .content = "Hey",
    },
    PhotoMessage {
        .sender_name = "Piana",
        .timestamp = @unixtimestamp_ms("1276134252080"),
        .photos = [
            Photo {
                .uri = "messages/inbox/...",
                .creation_timestamp = @unixtimestamp("1276134252"),
            }
        ],
        .reactions = [
            Reaction { .reaction = "😱", .actor = "Louis" },
            Reaction { .reaction = "😄", .actor = "Piana" },
        ]
    },
]

The structure is very similar, but we have named structs for additional context, and we traded the bare integers for tagged string literals for timestamps, giving hints for a parser. We can thus rename timestamp_ms to timestamp, as it would be redundant now.

And it really shines if we define a schema for the document:

root = [Message]

@unixtimestamp = bytes,
@unixtimestamp_ms = bytes,

Message = TextMessage | PhotoMessage

struct TextMessage {
    sender_name: bytes,
    timestamps_ms: @unixtimestamp_ms,
    content: bytes,
    reactions: ?[Reaction],
}

struct PhotoMessage {
    sender_name: bytes,
    timestamps_ms: @unixtimestamp_ms,
    photos: [Photo],
    reactions: ?[Reaction],
}

struct Photo {
    uri: bytes,
    creation_timestamp: @unixtimestamp,
}

struct Reaction {
    reaction: bytes,
    actor: bytes
}

It’s short, readable, grasped under a minute.

With a schema, it’s easy to validate a document, to have some tooling to help you touch it manually, and to serialize a valid object from a memory or fail—should your library be informed of it.

If we have the schema and parse towards a typed language, it is also probably unnecessary to prefix the name of structs. This makes the document terser at the expense of the standalone human readability.

I like Ziggy and would like to see it widespread.

Ser/deserialization of Ziggy documents in Python

So, after porting sqids to Zig, I thought a fun follow-up would be porting Ziggy to Python. And, having recently written some extensions for Zed, I have familiarized with tree-sitter so that parsing feels less scary now.

Ziggy document -> Python object

I started writing a tokenizer and a parser in Python following Nystrom’s Crafting interpreters. I thought, even though it’s just data, it was a good pet project to finally finish reading the book, as implementing Lox never really hyped me.

But (unfortunately?), I realize shortly after starting that Loris was kind enough to provide tree-sitter grammars for both Ziggy documents and Ziggy schemas! So, I ditched my clumsy in-house parser implementation and leveraged tree-sitter ones.

After generating Python bindings for the tree-sitter Ziggy grammar, you can parse any Ziggy document from python and get an AST. So, you are halfway towards an in-memory python object! Remains to walk the tree and produce python literals and data containers on the go. Tree-sitter is such a great piece of software.

I walk the tree recursively, as it feels the most natural. I considered using the tree cursor API, but I was struggling to envision how to properly use it. Besides, a great feature of tree-sitter is named nodes. If properly used in the grammar, it helps AST consumers to quickly discriminate between important, data-bearing nodes and the boring stuff. And with the cursor API, I am still unsure how to visit only named nodes.

Ziggy values have straightforward counterparts in python. The only part where I really made a choice is on how to represent structs. A dataclass is the most natural object, so I went for that.

class Parser:
    def interpret(self, node: ts.Node | None) -> object:
        if node is None:
            return None
        match node.type:
            case "document":
                return self.interpret(node.child(0))
            case "true":
                return True
            case "false":
                return False
            case "null":
                return None
            case "integer":
                assert node.text is not None
                return int(node.text)
            case "float":
                assert node.text is not None
                return float(node.text)
            case "identifier":
                return self.interpret_identifier(node)
            case "string":
                return self.interpret_string(node)
            case "quoted_string":
                return self.interpret_quoted_string(node)
            case "tag_string":
                return self.interpret_tag_string(node)
            case "map":
                return self.interpret_map(node)
            case "array":
                return self.interpret_array(node)
            case "struct":
                return self.interpret_struct(node)
            case "top_level_struct":
                return self.interpret_struct(node)

            [...]

            def interpret_tag_string(self, node: ts.Node) -> object:
                name = node.named_children[0].text
                assert name is not None
                name = name.decode("utf-8")
                v = self.interpret_quoted_string(node.named_children[1])
                if name in self.literals:
                    f = self.literals[name]
                    return f(v)
                return v

            def interpret_map(self, node: ts.Node) -> dict[str, object]:
                map = {}
                for c in node.named_children:
                    key_node = c.child_by_field_name("key")
                    assert key_node is not None
                    k = self.interpret_quoted_string(key_node)
                    v = self.interpret(c.child_by_field_name("value"))
                    map[k] = v
                return map

            [...]

Now, the most attractive features of Ziggy must be supported, namely named structs and tagged string literals. Loris gives good pointers for parsing from dynamic languages such as python. And indeed, a mapping of strings to constructors is enough to handle both named structs and tagged literals:

  • for tagged literals, the constructor accepts a string (the Ziggy document one) and returns any python object;
  • for named structs, the constructor accepts a dataclass (created by the parser) and returns any python object.
class Parser:
    def __init__(
        self,
        *,
        literals: Mapping[str, Callable] | None = None,
        structs: Mapping[str, Callable] | None = None,
    ):
        self.literals: dict = dict(literals) if literals is not None else {}
        self.structs: dict = dict(structs) if structs is not None else {}

Python object -> Ziggy document

Now, the converse.

The case of literals is easy: it’s basically calling str() on them. Then, I was afraid that it would become very tricky with the crazy mess you can get in Python in terms of objects, and how you never really know what type you have at hand, both in the source code and at runtime. But in fact, it’s not so bad, so:

  • python Sequence types are serialized as Ziggy lists,
  • python Mapping types are serialized as Ziggy maps,
  • and then we support only dataclasses, that are serialized as Ziggy structs.

But what about classes?! And sets? And my pydantic object!?

Well, I don’t want to confront with the complexity of attributes, properties, fancy inheritance and stateful objects, so I simply don’t support them. And frankly, if you want to serialize generic blobby objects rather than concrete data whose structure is suited to your target format, that’s on you.

Okay, but datetime objects of the std? Or UUIDs? Or my custom enum variants?

Ah, indeed, these are things we want to support and make easy. So, we want a serializer to be configurable so that the user can control how some types serialize. Most probably one wants to have them as a tagged string literal, as in @date("2024-12-12"), @unixtimestamp("1733958589") or @status("cancelled").

Publish a cleanish package

“The work is never the work”, or “finishing is a whole new project”… call it what you want but a significant part of this project was to get a nice and handy interface, that feels clean, have some tests, and neatly package the whole for the Python ecosystem.

Handy interface

The serializer must have ways to:

  • pick between quoted and multiline strings,
  • give a name to a Ziggy struct,
  • serialize an object as a tagged quoted string literal.

For flexibility, the load of serialization can be put onto the serializer or the objects themselves:

  • At the serializer level:
    When instantiating a serializer, you can pass a list of function that associates a type to a serialization function.
  • At the object level:
    Custom class can implement the ZiggySerializer abstract base class through the ziggy_serialize method, so that they serialize themselves to Ziggy.

The deserializer must have ways to:

  • associate struct name to dataclasses,
  • parse tagged literals.

This is done at the deserializer level only. It feels to complex to have an approach as in Go where you pass a pointer to an object satisfying the JSONUnmarshal interface, say.

Clean, “pythonic” feeling

I don’t know yet, I’ll have to use it on the daily.

I have a major issue though now, which is the naming of the function. I despise load(s) and dump(s) from stdlib json or tomllib, yet there are the natural candidates. Right now, I have parse and deserialize, but this makes me unhappy. We will see if I can find better names.

Tests?

I put some tests, but I do not want to make the test suite too heavy while Ziggy is not even 0.0.1.

Python ecosystem 🔥

There the real mess begins.

The parser produced by tree-sitter is in C, with a thin layer of python bindings. I am using uv (thank god for Charlie Marsh and Armin Ronacher) for managing python. And while its amazing, uv publish was unable to compile wheels. But to be fair, I was unable to compile wheels locally at all. Of course, this was a C tool chain issue where path and options were incorrectly setup, but I was unable to fix it. It’s only after remarking that any new tree-sitter project included by default a .github/workflows/wheels.yml that salvation came. There, the cibuildwheel package is leveraged to handle everything. So, I build wheels with GitHub actions, grab the artifact, and publish them manually with twine.

But, somehow, I had created an account on test.pypi.org and not pypi.org, which gave me lots of headaches when trying to push a wheel. This weird subdomain is there to “try distribution tools”. Who wants that?! To me this stems as a consequence of having a global index with bare names without any namespacing. Conflicts and namesitting are unavoidable, hence the necessity for a sandboxed test space like that I guess.

Go package index is much more sound in this respect, with third-party modules name being the URL of hosted git repo. Compared to PyPI, it does not even require an authentication system! You can dive deeper here.

As that’s finally it! With ziggy-tree-sitter and ziggy-python both on PyPI, you can try using Ziggy with python.